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Complex interactions between genes or pro- 
teins contribute a substantial part to pheno- 
typic evolution. Here we develop an evo- 
lutionarily grounded method for the cross- 
species analysis of interaction networks by 
alignment, which maps bona fide functional 
relationships between genes in diff"erent organ- 
isms. Network alignment is based on a scor- 
ing function measuring mutual similarities be- 
tween networks taking into account their in- 
teraction patterns as well as sequence similar- 
ities between their nodes. High-scoring align- 
ments and optimal alignment parameters are 
inferred by a systematic Bayesian analysis. 
We apply this method to analyze the evolu- 
tion of co-expression networks between human 
and mouse. We find evidence for significant 
conservation of gene expression clusters and 
give network-based predictions of gene func- 
tion. We discuss examples where cross-species 
functional relationships between genes do not 
concur with sequence similarity. 

Besides a wealth of genomic sequence information, 
m.olecular biology is accumulating more and more 
data probing the interactions between genes or pro- 
teins. Examples are regulatory interactions, where 
the expression level of one gene influences the ex- 
pression of another gene, or interactions between pro- 
teins, where pairs of proteins bind to form dimcrs or 
multimers. Interactions between genes or their prod- 
ucts are crucial for our understanding of biological 
functions. With the advent of experimental high- 
throughput methods, large-scale datasets of different 
organisms are becoming available, which can be an- 
alyzed by systematic cross-species comparison. 

This paper is devoted to developing an evolution- 
ary rationale for biological network analysis. Since 



the interactions between genes are encoded in their 
genomic sequences, this may seem a rather straight- 
forward generalization of established concepts in se- 
quence analysis: evolution acts as a divergent pro- 
cess on the constituents of the network, which gradu- 
ally reduces cross-species correlations of the network 
structure. Detecting these correlations requires an 
alignment procedure which can map functional units 
as network structures conserved by evolution as well 

as estimate the degree of divergence between species. 
However, interaction networks evolve in a more 

heterogeneous and a more correlated way than se- 
quences, which makes their cross-species comparison 
a considerably more challenging task. The interac- 
tions between proteins, for example, depend on the 
properties of a specific functional binding domain, 
which may evolve in a different way than the remain- 
der of the protein sequence, with correlations to its 
binding domain in a different protein. Regulatory in- 
teractions can change by the evolution of regulatory 
DNA, which is expected to be different from that 
of coding DNA. Moreover, many sequence changes 
in a gene may be irrelevant for its interactions mea- 
sured in a network. This leads us to treat the evolu- 
tion of the interactions within a network, i.e., its link 
dynamics, and the overall sequence evolution of its 
constituents, the node dynamics, as two independent 
modes of evolution. We describe these modes by sim- 
plified stochastic models and infer their relative con- 
tribution to network evolution by cross-species com- 
parison. This dynamics is also quite heterogeneous 
across the network, and we use the models of link 
and node dynamics to quantify the evolutionary con- 
servation of putatively functional network modules. 

Our evolutionary analysis is based on the align- 
ment of networks, i.e., a mapping between their 
nodes, which also induces a mapping of their links. 
In the Theory part of this paper, we develop a sta- 



tistical theory of alignment for biological networks. 
We introduce a scoring designed to detect local func- 
tional correlations, which uses both the similarities 
of mapped link pairs and of node pairs. This scoring 
derives from the underlying link and node dynamics. 

Various alignment and scoring procedures for bi- 
ological networks have been discussed in recent ar- 
ticles. One type of methods restricts the alignment 
to mutually homologous nodes, i.e, gene pairs with 
significant sequence similarity in different species. In 
this way, clusters of conserved interactions have been 
found in gene co-expression networks ^ |2| and in 
protein interaction networks O |3]. A complemen- 
tary approach is to align networks only by their link 
overlap, independently of node homology. Network 
motifs El defined by families of mutually similar 
subgraphs in a larger network have been identified in 
this way [3| as well as the similarities between regu- 
latory networks of different phages [S] . These meth- 
ods have been combined with their relative weights 
fixed ad hoc in ref. [S]. A third method called Path- 
blast |1(J[ [TT] evaluates the link similarity between 
networks along paths of connected nodes, using se- 
quence alignment algorithms. It has been applied 
to cross-species comparisons of protein interaction 
networks ^U). Similarly, the flux along the short- 
est paths in regulatory networks has been compared 
across species jHJ . Metabolic networks with few cycles 
have been analyzed by subtree comparison |12| . 

From an evolutionary point of view, these methods 
are heuristics containing different assumptions on the 
underlying link and node dynamics. Homology-based 
alignments are appropriate if the sequence divergence 
between the species compared is sufficiently small 
so that all pairs of functionally related nodes can 
be mapped by sequence homology. However, genes 
with entirely unrelated sequence may take on a simi- 
lar function in different organisms, and hence have a 
similar position in the two networks. (Such so-called 
non-orthologous gene displacements are well-known 
in metabolic networks |18l 1141 I15j.'l On the other 
hand, alignments by link similarity alone altogether 
ignore the evolutionary information of the node se- 
quences. Path-based alignment algorithms are well 
suited to networks with predominantly linear biolog- 
ical pathways such as signal-transduction chains. In 
other situations, however, it may be difficult to link 
the scoring parameters to evolutionary rates of link 
and node changes. 

The alignment method presented in this paper is 



grounded on statistical models for the evolution of 
links and nodes. Alignments are constructed from 
link and node similarity treated on an equal foot- 
ing, the relative weight of these score contributions 
is determined systematically by a Bayesian param- 
eter inference. Nodes without significant sequence 
similarity are aligned if their link patterns are suf- 
ficiently similar. Conversely, nodes are not aligned 
despite their sequence similarity, if their links, and 
hence their putative functional role, show a strong 
divergence between the two networks. Our method 
is rather general and can be applied both to net- 
works with binary link strengths (as in the current 
large-throughput data for protein interactions) and 
to networks with continuous link strength (such as 
the co-expression data used in this study). 

As an algorithmic problem, network alignment is 
clearly more challenging than sequence alignment, 
which can be solved by dynamic programming |16l 
I17| . Already simpler problems such as matching two 
graphs by determining the largest common subgraph 
are A''P-hard ^S], which implies there is probably 
no polynomial-time algorithm. We have developed 
an efficient heuristic, by which network alignment is 
mapped onto to a generalized quadratic assignment 
problem, which in turn can be solved by iteration of 
a linear problem jl^. 

In the second part of the paper, we present a cross- 
species comparison of co-expression networks of H. 
sapiens and M. musculus as an example application 
of our method. In this type of networks, the link be- 
tween a pair of genes is given by the correlation co- 
efficient of their expression profiles measured on an 
mRNA microarray chip. We show that correlation 
networks are well-suited for cross-species comparison: 
they are robust datasets even if individual expres- 
sion levels cannot be compared with each other since 
the experimental conditions differ between species. 
The evolution of these networks results from the evo- 
lution of regulatory interactions between genes and 
from loss and gain of genes. High-scoring alignments 
between expression networks in human and mouse 
provide a quantitative measure of divergence between 
the two species. We find conserved network struc- 
tures, related to clusters of co-expressed genes; simi- 
lar findings are reported in refs. |T] and ^j. However, 
the alignment found here differs from mere sequence 
homology. This leads to network-based predictions of 
gene functions, including functional innovations such 
as non-orthologous gene displacements. 



Theory 

Graphs and graph alignments. A graph A is a 
set of nodes with links between pairs of nodes. The 
graphs considered here are labeled by gene name, 
which is denoted by the node index i = 1, . . . , Na, 
and are thus uniquely represented by the adjacency 
matrix a = {an'). A graph is called binary if links 
are either absent [an' — 0) or present [an' = 1) 
and weighted if the link strengths an' take continuous 
values. The special case of a symmetric adjacency 
matrix is used to describe undirected graphs. For 
example, current high-throughput datasets of pro- 
tein interactions, which do not specify the interaction 
strength, produce binary undirected graphs. Gene 
expression networks, whose links denote the mutual 
correlation coefficient between expression patterns of 

two genes, are weighted graphs with — 1 < aw < 1. 
A local alignment between two graphs A and B 

is defined as a mapping tt between two subgraphs 
A C A and B = t:{A) C B as shown in Fig. [Ja). 
The alignments of networks discussed in this paper 
are designed to display cross-species functional rela- 
tionships between aligned node pairs. Due to gain or 
loss of genes in either species, not every gene in one 
network has a functional equivalent in the other, and 
the alignment algorithm has to determine the aligned 
subnetworks A and B with significant correlations. 
For the sake of algorithmic simplicity, we will discuss 
here only one-to-one mappings tt, which is appro- 
priate for most gene pairs but neglects multi-valued 
functional relationships induced for some genes by 
gene duplications. 

Link dynamics and link score. An important 
statistical characteristic of a network is the link dis- 
tribution p^{a), giving the probability that the link 
between a randomly chosen pair of nodes takes on 
the value a. The evolution of the link distribution 
can be modeled by a simple stochastic process, from 
which our link similarity scoring of an alignment is 
derived. In a binary network, the simplest form of 
link dynamics is a Markov process, which is fully de- 
termined by the rates of formation and loss of single 
links. Generalizing this dynamic to continuous links 
leads to a diffusion equation of the form 



dtp'{a)^[djg{a)-daf{a)]/{a). 



(1) 



The two terms on the r.h.s. describe the stochastic 
turnover and the average relaxation of links with co- 
efficient functions g{a) and /(a), respectively. For 
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Figure 1: Netw^ork alignments measure link and 
node similarity, (a) A local alignment n between two 
networks A,B is a one-to-one mapping (indicated by 
dashed lines) between nodes of the subsets A, 13. (b) The 
local link score Sf^jj) evaluates all pairwise similarities 
between links aai and 6,r(i)7r(i') (solid lines) for a given 
pair of aligned nodes, (c) The local node similarity score 
S^^/j) evaluates the overlap of the alignment with the 
node similarity &i,^(i) (dotted line). Top to bottom: 
Aligned node pairs (i) without similarity to any other 
node, (ii) with mutual node similarity, (iii) with (at least 
one) node similarity mismatch. 



mutual expression correlations between two genes in 
a microarray, this form can be derived from a stochas- 
tic model for loss and gain of regulatory interactions, 
each of which affects a random subset of the exper- 
iments, resp. cell types. The cross-species correla- 
tions in pairs of evolutionarily related links a, b are 
contained in the joint distribution q^{a,b), which we 
write in the form 



q\a,b) ^ p^ia) psib) exp[s'^ {a,b)], 



(2) 



defining the log-likelihood link similarity score 
s^{a,b). For binary links, this has a bilinear form 



s^{a, b) = Xiab + a^a + cr^b + const. 



(3) 



with the link match reward Af depending on the evo- 
lutionary distance between the species. The additive 
constant is given by the normalization of the proba- 
bility distributions in |(5|. For continuous links, we 
write the joint distribution as q^{a, b) = G(6|a)p^(a), 
where G(b\a) is the conditional distribution of link 



strengths b evolved from an initial strength a over 
the evolutionary distance between the two species 
compared. For short evolutionary distances and 
for a link evolution of the form |Q, this distribu- 
tion is well approximated by a Gaussian, G{b\a) ~ 
exp[—X£g{^^) {a — b)"^]. For large evolutionary dis- 
tances, it can be shown that the score s^{a,b) = 
G{b\a)/p^^{b) has again the asymptotic form Q. 
Given datasets of two networks A and B, the distri- 
bution p\, pg can be estimated from the frequency 
of link strengths aw resp. bjj> in one species, and 
q from the frequency of link pairs {au' , bjj' ) involv- 
ing orthologous gene pairs {i,j) and {i',j')- Hence, 
the score function s^{a,b) defined by eq. Q can be 
inferred without specific assumptions on the underly- 
ing link dynamics. For the example discussed below, 
this empirical link score turns out to be in remark- 
able agreement with the form Q predicted by the 

link diffusion model. 

This scoring of individual link pairs is readily gen- 
eralized to pairs of networks {A, B) with a given lo- 
cal alignment tt. Assuming that aligned link pairs 
(oii', 6^(iw(i/)) follow the distribution q^{a,b) inde- 
pendently from each other and unaligned links aui 
and bjj' follow the distributions p^xi'^) ^^^ pg{b), re- 
spectively, we obtain the distribution of graph pairs 
for given tt, 

g^(a,b|7r) = Pi(a)P^(b)exp[5^(a,b,^)], (4) 

where F|(a) = Ili.j'eA Pa (""'); PbO^) has a similar 
product form, and the network link score ^^(a, b,7r) 

i,7r{i) 



is a sum of local contributions S*; „,,^ of aligned node 



pairs, 

S\a,h,7T) 



7 . ^i,7r{i) 



7 , S (aij',&7r(i)7r(i'))' 
i,i'GA 



(5) 
as shown in Fig. ^b). For co-expression networks, 
there are correlations between links within one net- 
work. These occur since the number of independent 
measurements, d, is smaller than the number of genes 
N, and are taken into account by the overall scale of 

the link score (i.e, X^ ^ a^ ^ d/N). 

The relative evolutionary conservation of a given 

pair (a, b) of aligned links within the network is mea- 
sured by its excess link score 

As\a,b) = s\a,b) - {{sHa,b'))„ + {/{a'Ma')/2, 

(6) 
i.e., the difference of its link score and the aver- 
age over all aligned link pairs with either strength a 



fixed, {s^{a,b'))b' = J db'G{b'\a)s^{a,b'), or strength 
b fixed. The relative conservation of link patterns be- 
tween a pair of aligned nodes i, 7r(i) is then given by 
the local excess link score 



AS: 



i,'7T{i) 



i'eA 



As\a. 



ii' 1 ^7r{i)7r{i') ) 



(7) 



These measures will become important for the iden- 
tification of network clusters and their evolutionary 
conservation. 

Node dynamics and node score. The pairwise 
similarity between genes in networks A and B is given 
by a matrix 0, whose entries 9ij quantify, for exam- 
ple, the overall sequence similarity between the gene 
sequences i d A and j e _B or a biochemical similarity 
between the corresponding proteins. The sequence 
similarity between functionally related genes decays 
over time due to local mutations, but is also affected 
by large-scale genomic events such as gene duplica- 
tions, gene loss, or recruitment of new genes into 
a functional context. Due to these processes, both 
networks contain a fraction of nodes with little or 
no significant sequence similarity to any node in the 
other network, which should nevertheless be included 
in the alignment if their local link score suggests sig- 
nificant functional cross-species relationships. More- 
over, functional swaps between genes induce func- 
tional correlations between genes that are unrelated 
by sequence and, at the same time, reduce corre- 
lations between other genes despite their sequence 
similarity. A prominent example is non-orthologous 
gene displacements I^^^^]. It is these processes 
that cause the network alignment to deviate for some 
nodes from a map based only on sequence homology. 
Functional swaps can be regarded as part of the link 
evolution, which in co-expression networks leads to 
coherent link changes at the affected nodes. How- 
ever, these swaps are likely to involve selection and 
are not captured by the independent link dynamics 
discussed above. Hence, we include them here as a 
separate type of process with its own evolutionary 
rate. 

The resulting statistics of node similarity can be 
described by the distribution of pairwise similarity 
coefficients between unaligned nodes, Po{d), between 
pairs of aligned nodes, g"(0), and between one aligned 
node and nodes other than its alignment partner, 
92 (^)- Note that Po{9) does not simply describe un- 
correlated sequences: significant sequence similarity 
may exist between genes that are not aligned due 



to their link mismatch, since functional changes can 
lead to a rapid divergence of links, for example, in 
the formation of a pseudogene. 

The log-likelihood node similarity scores s" (0) and 

32(0), which are defined by 

q^i9) - p"„{0) eM4m, 12(0) = Pm exp[s^(^)], 

(8) 
quantify the dependence of the alignment on node 
similarity. Assuming that the coefficients 6ij are 
drawn independently from these distributions, we ob- 
tain the distribution of node similarity for a pair of 
networks A and B with a given alignment tt. 



Q"(0K)=Po"(0)exp[^"(0,7r)] 



(9) 



where Pq{&) = Yit jPoi^ij) ^^"^ ^^^ network node 
score S'"(0,7r) is again a sum of local contributions 



,) and S2 



In this paper, we use a simple 



binary approximation of node similarity: two genes 
are counted as orthologous (Oij = 1) if they appear as 
putative orthologs in the Ensembl-database [20] , and 
otherwise not {9ij — 0). Each node may have sev- 
eral such putative orthologs. The three distributions 
in IJHI) are then fully determined by three model pa- 
rameters, p'^{e) ~ exp[Co6'], qi{0) ^ exp[(Co + A„)6'], 
and 92 (^) ~ 6xp[(('o + A'j)6'], which in turn depend on 
the rates of the node dynamics and on the evolution- 
ary distance between the species. A short calculation 
shows that the node score ® takes the form 



5"(©,^)=5](5i:, 



(■0 



f^ 



(10) 



ieA 



Here the local node similarity score 



ifE 



^i,Tr(i) = { ^n if ^i7r(i) — Ij 

A' otherwise 



E. 



ii£A ^i'-r:{i) 



(11) 

measures the overlap of alignment tt and homology 
map as shown in Fig. ^c), and the "chemical po- 
tential" /x(A„,A^,Co) implicitly determines the over- 
all number of nodes in the alignment (for details, see 
the Supporting Text). For large /i, the highest scores 
occur in global alignments between the networks A 
and -B, which involve all nodes of the smaller net- 
work. This is appropriate if the evolution of links 
and nodes maintains for all nodes some functional 
relationship within the network. In the case of this 
study, link and node dynamics destroy significant cor- 
relations for some nodes. We obtain local alignments 



with chemical potential /x < 0, which exclude some 

nodes of both networks. 

Hidden Markov model and Bayesian analysis. 

We can now combine the distributions Q^ and Q" 
into a probabilistic model for link and node similar- 
ity, which produces the observable data, i.e., pairs of 
networks with adjacency matrices a, b and node simi- 
larity matrix 0, for a given alignment tt and for given 
model parameters m = (s^, A„, A^, Co) in eqs. |(SJ| and 
(llUf) . The combined model is given by the probability 
distribution 

Q(a, b, 0|7r, m) = Q\&, b, 0|7r, m)g"(0|7r, m) (12) 
= exp[5(a, b, 0, ^, m)]Pi(a)P|(b)Po"(0, Co) 

with the alignment score function 

5(a, b, 0, TT, m) = ^^(a, b, tt, m) + 5"(0, tt, m). 

(13) 
Eqs. H12|l and \i'A\ are at the heart of our scoring 
procedure: they provide a probabilistic rationale for 
the cross-species analysis of network data by link and 
node similarity. The model parameters m, which de- 
termine the relative weight of link and node score, 
and the alignment tt are "hidden" variables, which 
can be inferred by a standard Bayesian analysis. We 
write their posterior probability, i.e., the conditional 
probability of the hidden variables for given data 
a, b, 0, in the form 



(9(7r, m|a, b, 0) 



Q(a,b, 0|7r,m)P(7r,TO) 
E,,„Q(a,b,0K,m)P(7r,7 



(14) 

and assume the prior probability P(7r, m) to be 
fiat. Dropping the terms independent of tt and 
m, we obtain the optimal local alignment tt* by 
maximizing the posterior probability Q(7r|a, b, 0) ^ 
^^ (5(a, b, 0|7r, TTi) and similarly the optimal scor- 
ing parameters m* by maximizing (5(?7i|a, b, 0) ^ 
^^ (5(a, b, 0|7r, to). In a Viterbi approximation, 
TT* and m* can be inferred jointly by maximizing 
Q(a, b, 0|7r, ?7i). This amounts to determining the 
optimal null model parameter Co a-nd maximizing the 
combined score 5(a, b, 0, tt, m). Details are given in 
the Supporting Text. 

Alignment algorithm. Our algorithm for maxi- 
mizing the score is based on a mapping to a gener- 
alized quadratic assignment problem, which is solved 
by an iterative heuristic similar to 19 with running 
times of order N^ |23 (for details, see Supporting 
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Figure 2: Performance characteristic of the align- 
ment algorithm. The fraction pout of correctly aligned 
nodes is plotted against the number of iterative steps n 
for fractions pin — 0.01 (diamonds), 0.02 (squares), 0.5 
(circles) of the node similarity given as input. Typically 
the algorithm converges after about 5 iterations. There is 
a switch from low to high alignment quality (pout > 0.9) 
at a threshold value pc ~ 0.02. 



Text) . To quantify the performance of the algorithm 
for co-expression networks, we have used a human 
microarray dataset [22| , consisting of expression mea- 
surements of different tissues. We randomly parti- 
tioned the experiments into two equally large subsets, 
and thus obtained two "mirror copies" of the expres- 
sion correlation network in one species. The nodes in 
the two networks are identical and their links differ 
only by experimental noise. The correct alignment 
of these two copies is trivial, 7r(i) — i. A fraction 
Pin of correctly aligned nodes with randomly chosen 
indices i is given as input for the algorithm by spec- 
ifying the corresponding node similarity coefficients 
(^ij = Sij, the remaining node information is ignored 
{6ij = 0). We then record the fraction of correctly 
aligned nodes pout (n) of the algorithm as a function 
of the number of iterations n, see Fig.|21 This perfor- 
mance characteristic shows a switch from low to high 
alignment quality at a threshold value pc ~ 0.02. In 
the low-quality regime (pin < Pc), the alignment con- 
tains for all n only the nodes given as input. In the 
high-quality regime (pin > Pc), the iterations continu- 
ously improve the fraction of correctly aligned nodes, 
saturating at an accuracy pout > 0.9 for large n. Of 
course, the threshold will be higher and the satu- 
ration accuracy lower for cross-species comparisons, 
where the networks differ by evolutionary changes 
and by larger experimental variation. Similarly, the 
threshold rises if the network is randomly diluted (to 
Pc ~ 0.2 when 95% of all links have been set to zero). 



Results 

Aligning human/mouse expression data. The 

co-expression networks were constructed from the ex- 
pression data of Su et al. |^ obtained from 79 tis- 
sues in human, 61 tissues in mouse, and a set of 
biological and technical replicates of the same size. 
Similar experimental protocols were used in both 
species, making the data particularly suitable for 
cross-species comparison. Our networks A (human) 
and B (mouse) of size Na = Nb = 2065 contain all 
genes which are expressed in all samples and show 
a low variance of expression levels across samples in 
both species (housekeeping genes) , as well as all genes 
having a high expression similarity with at least one 
such housekeeping gene. The link strength aai is de- 
fined as the Spearman correlation between the ex- 
pression levels of the human genes i and i' across all 
tissues, and similarly bjji in mouse. Both networks 
have a broad distribution of link values; the distri- 
bution p^(a) in human is shown in Fig. ^a). To 
determine the link scoring function s^{a,b), we look 
at all human gene pairs («,i') which have homologs 
{j,j') in mouse and compute the distribution of link 
pairs a = aw and b = bjji . The optimal alignment 
TT (along with the optimal node model parameters 
A„,A^,Co) is then inferred by likelihood maximiza- 
tion as described above; it consists of 1956 genes. 

The overall cross-species variation of expression 
is given by the root mean square difference A^ = 
y/{{a — b) ) (with the brackets (...) indicating the 
average over all aligned link pairs a,b), we find A^ = 
0.33. To separate this variation into evolutionary 
changes and sampling noise, we again construct co- 
expression networks from a randomly chosen subset 
containing half the expression measurements from ei- 
ther organism and obtain A^ = 0.35, i.e., sampling 
contributes only a small fraction to Af. The align- 
ment is also remarkably stable with respect to this 
change of the dataset: 85% of the nodes are aligned 
to the same partner. This shows that co-expression 
networks provide a faithful representation of evolu- 
tionary changes of expression patterns. 

To trace the link evolution between the networks 
in more detail, wc look at the conditional distribution 
G{a\b) of correlation values aai = a in human given 
a certain correlation value b„u\^ui\ = & in mouse, 
which is shown in Fig. Ofb) as a function of a for 
three different values of b. As expected, the variance 
of G{a\b) is largest for weak correlations and less for 




G(alb) 
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Figure 3: Evolution of co-expression links between human and mouse, (a) The distribution oi p^^{a) of link 
strengths in human, (b) The conditional distribution of link strengths in human, G{a\b), plotted against a for the 
values b = —0.75 (dotted), (dashed), 0.75 (full) in mouse. The heavy solid line shows the conditional distribution 
G{a\b = 0.75) restricted to links within expression clusters, see text, (c) The empirical link scoring function s^{a,b) 
for 6 = -0.75 (dotted), (dashed), 0.75 (full). 



strong positive or negative correlations. The result- 
ing link scoring function s^{a,b) = log[G{a\b)/pj^{a)] 
is a linear function of a with the slope determined by 
6, seeEq. © and Fig.Ofc). 

Conserved network patterns. Co-expression net- 
works are not homogeneous ^1\ Instead, they are 
organized in clusters of genes which have similar ex- 
pression profiles. In the mouse network, we call a 
gene j clustered if it has a correlation bjj> > 0.8 with 
more than 15 other genes (the average number of 
links b > 0.8 is approximately 1 per gene). With this 
definition, there are 40 clustered genes in the net- 
work B (little of the following depends on the exact 
thresholds chosen). The thick line in Fig.|3|[b) shows 
the conditional distribution G{a\b — 0.75) restricted 
to links b = bjf involving a clustered mouse gene 
j. The root mean change of the expression correla- 
tions is Af = 0.22. This is a reduction by a factor of 
two, compared to the distribution G{a\b = 0.75) for 
all genes. This reduced change of expression correla- 
tion for clustered genes translates into a local excess 
link score iQ of /S.S^ ~ 10 per gene. This suggests 
that clustered genes have more strongly conserved 
expression patterns than genes which are not part of 
clusters, see also ref. p. Fig.^fa) shows the link evo- 
lution between a set of clustered genes (arranged in a 
circle) and a randomly chosen set of genes outside this 
cluster (arranged in a straight line). The link inten- 
sity encodes the correlation strength a in human, the 
color its evolutionary conservation as measured by 
the excess link score As^(a, b). Intra-cluster links are, 
on average, stronger (i.e., more intense) and at the 
same time more conserved (i.e., contain more blue) 
than links with external genes. The genes contained 



in this cluster are involved in the control of transcrip- 
tion and code for constituents of the ribosome; their 
full list of is given as Supporting Table. 
Correlations between link and node similarity. 
Figure |S1 shows an overall correlation between cross- 
species sequence similarity quantified by the score of 
the best nucleotide Blast hit |23] and link similarity 
measured by the excess link score AS'^. Gene pairs 
with a high sequence score also have a bias towards 
high link similarity. However, the converse is not 
true: most of the gene pairs with strongly conserved 
expression patterns have only average sequence simi- 
larity. An example is the gene cluster discussed above 
(marked by grey diamonds in fig.O, which has a sig- 
nificant excess link score AS*^ ^ 10 and a sequence 
score of 440 per gene, which is not significantly above 
the network average of 394. 

Network-based gene annotations. Network 
alignment as a putative functional map differs from 
the homology map of individual genes: there are 
genes without an (easily detected) homologous part- 
ner in the other network. These genes are aligned 
solely on the basis of their link score. Although our 
dataset is centered around housekeeping genes and 
may be biased against such cases, the maximum- 
likelihood alignment contains significant cases of such 
link-based alignments, which are reported in the Sup- 
porting Table (and marked by gray circles in fig.[5l). 
(i) Human-ORICI is aligned to mouse-01fr836 
with a local link score S^ = 16.1 exceeding the aver- 
age value 6.7 between orthologs. A functional rela- 
tionship between these genes is quite plausible: Not 
only are both genes involved in olfactory receptor ac- 
tivity |24j. they also have two protein domains in 
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Figure 5: Node versus link evolution. For aligned 
pairs of genes (i,7r(i)), the nucleotide Blast score with 
standard parameters (vertical axis) is plotted against the 
excess link score AS^^,^-, (horizontal axis). Genes in the 
cluster shown in Fig. |1] (gray diamonds) are distinguished 
by high link similarity, but most of them show no en- 
hanced Blast score. The gene pairs (i) and (ii), aligned 
solely on the basis of the link score (see text), are indi- 
cated by gray circles. 



1 



Figure 4: Evolutionary conservation of gene clus- 
ters, (a) 7 genes from a cluster of co-expressed genes to- 
gether with 7 random genes outside this cluster (straight 
line) . Each node represents a pair of aligned genes in hu- 
man and mouse. The intensity and color shading of a link 
encode the correlation value a in human and its relative 
evolutionary conservation between the two species (see 
color bars). This cluster contains the non-orthologous 
aligned gene pair human-HMGNl/mouse-Parp2, predict- 
ing a non-orthologous gene displacement, (b) The same 
cluster, but with human-HMGNl "falsely" aligned to 
its ortholog mouse-HMGNl (left), and human-PARP2 
aligned to its ortholog mouse-Parp2 (right, with the inten- 
sity encoding the correlation in mouse). This mismatch 
shows the poor expression similarity for this pair of genes. 



comnion and belong to the same gene family. How- 
ever, their overall DNA sequence identity is below 
60%, compared to a typical value of 85% between 
orthologs in human and mouse. Most likely these 
genes are distant orthologs, predating the human- 
mouse split. This is an example where functional 
constraints maintain a high level of conservation at 
the network level, but not at the sequence level. 

(ii) In the case human-HMGNl/mouse-Parp2, 
both genes have orthologs but the network align- 



ment does not match the orthology map. As shown 
in Fig. Efa), the human gene HMGNl is part of 
a gene cluster, and the alignment to mouse-Parp2 
(with S^ = 25.1) respects the evolutionary conserva- 
tion of that cluster. The "false" alignment human- 
HMGNl/mouse-HMGNl respects orthology but pro- 
duces a link mismatch (S^ — —12.4) due to the 
poor expression similarity of mouse-HMGNl with the 
other genes of the cluster; see Fig. EJb). Human- 
HMGNl is known to be involved in chromatin modu- 
lation and to act as an RNA-polymerase II transcrip- 
tion factor, in particular through altering the acces- 
sibility of regulatory DNA. The network alignment 
predicts a similar role of Parp2 in mouse, which is 
distinct from its known function in the poly(ADP- 
ribosyl)ation of nuclear proteins. This prediction is 
consistent with a recent experimental study inhibit- 
ing the members of the Parp gene family in mouse. 
The authors conclude that "in addition to known 
functions of poly(ADP-ribosyl)ation, some so far un- 
recognized, non-redundant functions may also ex- 
ist" , specifically the chromatin-remodeling involved 
in gene expression changes during development |25| . 



Discussion 

Alignment provides a quantitative measure 
of network divergence. We have developed a 
probabilistic alignment procedure for biological net- 
works based on their link and node similarity. Both 
components of similarity are important, i.e., a net- 
work alignment differs, in general, both from a 
mere matching of link patterns and from a mere 
node homology map. To the extent that signifi- 
cant sequence homology is present, it clearly intro- 
duces a bias for the functional association of genes 
across organisms, and hence for the alignment. It 
is this bias that makes the problem computation- 
ally tractable: although there is probably no formal 
solution by a polynomial-time algorithm, biological 
network alignment allows for more efhcient heuris- 
tics than generic pattern matching. (We have dis- 
cussed here an alignment of about 2000 genes, but 
ongoing studies suggest the method can be scaled up 
to genome-wide cross-species comparisons of verte- 
brates.) On the other hand, the homology relations 
are not completely respected even between relatively 
close species: network alignment thus predicts a devi- 
ation of functional evolution from sequence evolution 
for some genes. Assessing the statistical significance 
of such functional swaps requires tuning the relative 
weight of link and node similarity in a consistent way, 
which is done here by a Bayesian inference from the 
datasets. 

Cluster conservation and selection. There are 
important differences in the population genetics of 
sequences and networks. Sequence divergence has 
an approximate molecular clock of synonymous nu- 
cleotide changes, which can be described approxi- 
mately by neutral evolution. Adaptive changes can 
be quantified relative to neutral evolution. For inter- 
action networks, the relative weights of neutral evolu- 
tion, negative and positive selection are far less clear. 
Indeed, the role of selection in the evolution of expres- 
sion patterns is currently under debate jJHl 1^ • Even 
the direction of evolution may not be as predomi- 
nantly divergent as for sequences: the selection for 
a given function may lead to convergent evolution of 
network structures. Nevertheless, there is some reg- 
ularity in the evolution of expression patterns: genes 
which are part of a strongly correlated cluster in one 
species have a significantly reduced cross-species vari- 
ation of their expression profile; this conservation is 
quantified by a typical excess link score AS^ of or- 



der 10 per gene. Selection for functionality is indeed 
a possible explanation. However, as the example of 
Fig. ^ shows, selection in a network can be rather 
complex: conservation of a gene cluster as a whole 
could be attributed to purifying selection at the level 
of network interactions, but this does not exclude 
positive selection leading to functional swaps at the 
level of network constituents. 



Network-based prediction of gene function. 

Given a cross-species alignment of gene networks, 
we can quantify link and node evolution. For our 
cross-species analysis between human and mouse, 
the correlations between these two modes are shown 
in Fig. [SI Although high sequence similarity pre- 
dicts high link conservation, most of the gene pairs 
with high link conservation have only average se- 
quence similarity. Hence, the network alignment con- 
tains functional information beyond the correspond- 
ing sequence alignment: it detects evolutionary con- 
servation which is not discernible by a comparison 
of overall similarity between sequences. Identifying 
genes with conserved expression patterns will also 
aid the cross-species analysis of regulatory binding 
sites, where a rapid turnover of binding sites de- 
spite the conservation of expression patterns has been 
found EH). 

Extreme cases of mismatch between link and node 
evolution are gene pairs with significantly similar in- 
teraction patterns but with no significant sequence 
similarity at all. This mismatch can be due to long- 
term sequence evolution between orthologous genes, 
which randomizes their sequence similarity, while 
their functional roles are more conserved. It may 
also arise from link dynamics leading to link similar- 
ities between genes that are completely uncorrelated 
at the sequence level ^l^^E). In our alignment 
of co-expression networks, we find evidence for both 
processes. Thus, the alignment leads to functional 
predictions on the basis of network similarity alone, 
in cases where a functional annotation is known for 
one of the aligned genes. 
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